A Plot is Worth a Thousand Tests: Assessing Residual Diagnostics with the Lineup Protocol

ASC & OZCOTS 2023

Weihao (Patrick) Li

Monash University, Australia

✍️Co-authors

Professor Dianne Cook, Department of Econometrics and Business Statistics, Melbourne, Monash University, Australia

Dr. Emi Tanaka, Biological Data Science Institute, Australian National University, Canberra, Australia

Assistant Professor Susan VanderPlas, Statistics Department, University of Nebraska, Lincoln, USA

📜Liteature of Regression Diagnostics

Graphical approaches (plots) are the recommended methods for diagnosing residuals.

  • Draper and Smith (1998) and Belsley, Kuh, and Welsch (1980):

Residual plots are usually revealing when the assumptions are violated.

  • Cook and Weisberg (1982):

Graphical methods are easier to use.

  • Montgomery and Peck (1982):

Residual plots are more informative in most practical situations than the corresponding conventional hypothesis tests.

🤔Challenges in Interpreting Residual Plots

What do you observe from this residual plot?

  • Vertical spread of the points varies with the fitted values.
  • This often indicates the existence of heteroskedasticity.

🤔Challenges in Interpreting Residual Plots

  • However, this is an over-interpretation.

  • The fitted model is correctly specified!

  • The triangle shape is caused by the skewed distribution of the regressors.

🔬Visual Inference

The reading of residual plots can be calibrated by an inferential framework called visual inference (Buja, et al. 2009).

Typically, a lineup of residual plots consists of

  • one data plot
  • \(19\) null plots containing residuals simulated from the fitted model.

To perform a visual test

  • Observer(s) will be asked to select the most different plot(s).
  • The p-value can be calculated using the beta-binomial model (VanderPlas et al., 2021).

⚔️Conventional vs. Visual

To understand why regression experts consistently recommend plotting residuals for regression diagnostics, we conducted an experiment to compare conventional hypothesis testing with visual testing.

🧪Experimental Design

Non-linearity model:

\[\boldsymbol{y} = \boldsymbol{1}_n + \boldsymbol{x} + \boldsymbol{z} + \boldsymbol{\varepsilon},~ \boldsymbol{z} \propto He_j(\boldsymbol{x}) \text{ and } \boldsymbol{\varepsilon} \sim N(\boldsymbol{0}_n, \sigma^2\boldsymbol{I}_n),\]

where \(\boldsymbol{y}\), \(\boldsymbol{x}\), \(\boldsymbol{\varepsilon}\) are vectors of size \(n\), \(\boldsymbol{1}_n\) is a vector of ones of size \(n\), and \(He_{j}(.)\) is the \(j\)th-order probabilist’s Hermite polynomials.

Null regression model:

\[\boldsymbol{y} = \beta_0 + \beta_1\boldsymbol{x} + \boldsymbol{u}, ~\boldsymbol{u} \sim N(\boldsymbol{0}_n, \sigma^2\boldsymbol{I}_n).\]

🧪Experimental Design

Heteroskedasticity model:

\[\boldsymbol{y} = 1 + \boldsymbol{x} + \boldsymbol{\varepsilon},~ \boldsymbol{\varepsilon} \sim N(\boldsymbol{0}, 1 + (2 - |a|)(\boldsymbol{x} - a)^2b \boldsymbol{I}),\]

where \(\boldsymbol{y}\), \(\boldsymbol{x}\), \(\boldsymbol{\varepsilon}\) are vectors of size \(n\), and \(\boldsymbol{1}_n\) is a vector of ones of size \(n\).

Null regression model:

\[\boldsymbol{y} = \beta_0 + \beta_1\boldsymbol{x} + \boldsymbol{u}, ~\boldsymbol{u} \sim N(\boldsymbol{0}_n, \sigma^2\boldsymbol{I}_n).\]

🧪Experimental Design

🧪Experimental Design

🧪Experimental Design

💪Power of Visual Tests

We use the logistic regression to estimate the power:

\[Pr(\text{reject}~H_0|H_1,E) = \Lambda\left(log\left(\frac{0.05}{0.95}\right) + \beta_1 E\right),\]

where \(\Lambda(.)\) is the standard logistic function given as \(\Lambda(z) = exp(z)/(1+exp(z))\).

  • The effect size \(E\) is the only predictor calculated using the KL-divergence (Kullback and Leibler, 1951).

  • The intercept is fixed to \(log(0.05/0.95)\) so that \(\hat{Pr}(\text{reject}~H_0|H_1,E = 0) = 0.05\).

🛠️Experimental Setup

Prolific (Palan and Schitter, 2018):

  • 7974 evaluations
  • 1152 unique lineups
  • 443 subjects

Every subject was asked to:

  • Evaluate a block of 20 lineups.
  • Select one or more plots that are most different from others.

⚖️Main Results: Power Comparison of Conventional Tests and Visual Tests

⚖️Non-linearity Patterns

⚖️Heteroskedasticity Patterns

The visual test rejects less frequently than the conventional test, and (almost) only rejects when the conventional test does.

🌟An example of conventional tests being too sensitive

Data plot (No.1):

  • undistinguishable from null plots
  • extremely small effect size (\(log_e(E) = -0.48\))
  • non-linearity pattern (S-shape) is totally undetectable

RESET test rejects the pattern (\(p\text{-value} = 0.004\)).

Visual test produces more practical \(p\text{-value} = 0.813\).

🧐Main Conclusions

  1. Conventional tests are more sensitive to weak departures.

  2. Conventional tests often reject when departures are not visibly different from null residual plots.

  3. In these cases, visual tests provide a more practical solution.

  4. Regression experts are right. Residual plots are indispensable methods for assessing model fit.

Thanks! Any questions?